This is data that the city of Boulder keeps open to the public. They maintain ~30 data sets for various different areas of interest such as crime, homelessness, and city affairs. This analysis is leveraging the data set, 2019 Council Emails Dataset:
Data set containing 2019 emails to council@bouldercolorado.gov. This version works best for viewing in Excel, as it includes only the plain text version of the emails. This file is updated daily with new emails.
raw_data <- read_csv("http://www-static.bouldercolorado.gov/docs/opendata/CouncilEmails_PlainText2019.csv")
## Parsed with column specification:
## cols(
## SentFrom = col_character(),
## SentTo = col_character(),
## SentCC = col_character(),
## ReceivedDate = col_character(),
## EmailSubject = col_character(),
## PlainTextBody = col_character(),
## MessageIdentifier = col_character()
## )
# raw_data %>%
# datatable()
This data set has the following columns : SentFrom, SentTo, SentCC, ReceivedDate, EmailSubject, PlainTextBody, MessageIdentifier. They are relatively self explanatory. One item to note is that all emails included in this data set are sent to council@bouldercolorado.gov. They may also be sent to others such as boulderplanningboard or specific city council members.
These data are a little messy. Not bad in the grand scheme of data messiness but we need to remove some junk if we want any reasonable results. I have created a table of what I removed and why:
removed <-
list()
removed$no_reply <-
raw_data %>%
filter(SentFrom == "No Reply") # these are spam
removed_rows <-
bind_rows(removed, .id = "removal_reason")
city_council_members <-
c("Jones, Suzanne",
"Weaver, Sam",
"Brockett, Aaron",
"Yates, Phillip",
"Carlisle, Cynthia",
"Grano, Jill",
"Morzel, Lisa",
"Young, Mary",
"Nagle, Mirabai")
city_officials <-
c("Aulabaugh, Shannon", # police spokesperson
"Brautigam, Jane") #city manager
data_clean <-
raw_data %>%
# remove bad data
anti_join(removed_rows %>%
select(-removal_reason)) %>%
# create some flag variables
mutate(IsMasked = ifelse(str_detect(SentFrom, "\\[*\\]"), T, F),
IsReply = ifelse(str_detect(EmailSubject, "^R[E,e]:"), T, F),
FromCCM = ifelse(str_detect(SentFrom,
pattern = str_c(city_council_members, collapse = "|")),
T,
F),
ToCCM = ifelse(str_detect(SentTo,
pattern = str_c(city_council_members, collapse = "|")),
T,
F),
FromCO = ifelse(str_detect(SentFrom,
pattern = str_c(city_officials, collapse = "|")),
T,
F)) %>%
# format strings
mutate_at(vars(EmailSubject), str_to_lower) %>%
mutate(ReceivedDate = as_date(ReceivedDate))
## Joining, by = c("SentFrom", "SentTo", "SentCC", "ReceivedDate", "EmailSubject", "PlainTextBody", "MessageIdentifier")
removed_rows %>%
group_by(removal_reason) %>%
summarise(n_rows_removed = n()) %>%
ungroup() %>%
format_table()
| Removal reason | N rows removed |
|---|---|
| no_reply | 302 |
The meat of this data set is the plain email text body. We’ll get to that later. First, let’s look at the meta data about whose sending the emails and when they are sending them.
My hypothesis is that there aren’t too many unique senders. Unfortunately public engagement is low in general. As expected, we see a small handful having set the majority of emails.
frequency_email <- data_clean %>%
count(SentFrom, FromCCM, FromCO) %>%
mutate(FromCCM = case_when(FromCCM ~ "City Council Member",
FromCO ~ "City Official",
T ~ "Constituent")) %>%
arrange(desc(n)) %>%
mutate(SentFrom = as_factor(SentFrom))
frequency_email %>%
plot_ly(x = ~SentFrom, y = ~n, color = ~FromCCM, type = "bar")
If we look at some descriptive statistics about whose sending emails and their frequency, we get the following:
frequency_email %>%
summarise(total_observations = sum(n),
mean = mean(n),
median = median(n),
mode = getmode(n),
max = max(n),
min = min(n)) %>%
format_table()
| Total observations | Mean | Median | Mode | Max | Min |
|---|---|---|---|---|---|
| 5180 | 2.060461 | 1 | 1 | 109 | 1 |
As expected, this is a very skewed distribution with most only sending one email and one individual having sent 109.
Looking at the dates that emails get sent, it appears that there are outside triggers that incent people to send emails to the city council on certain days. Without looking at the calandar for city council sessions, it is a good guess that the days with the most email traffic are days prior and post city council sessions.
data_clean %>%
count(ReceivedDate) %>%
plot_ly(x = ~ReceivedDate, y = ~n, type = 'scatter', mode = 'lines')
Based on reply signs in the email subject, we can also see how engaged the council or city officials are with constituents:
data_clean %>%
count(IsReply) %>%
filter(!is.na(IsReply)) %>%
mutate(IsReply = ifelse(IsReply, "Response Email", "Original Email")) %>%
plot_ly(x = ~IsReply, y = ~n, type = "bar")
This is a bit low, I would expect more engagement since most emails are relatively unique and not campaigns by interest groups. Let’s see what types of trends we can find based on when city officials reply.
People frequently copy paste an email in order to show more public awareness to an issue. We are checking by email subject for now. As expected, there are some movements in Boulder that coordinate the efforts through this email channel.
data_clean %>%
filter(!IsReply) %>%
count(EmailSubject) %>%
arrange(desc(n)) %>%
mutate(EmailSubject = as_factor(EmailSubject)) %>%
plot_ly(x = ~EmailSubject, y = ~n, type = "bar")
We are starting to see some trends here. This is not conclusive, but gives us a good starting point for further analysis. There are a couple of notable trends. Mostly surounding Boulder’s Open Space Mountain Parks, Bike Trails, Vaping, Homeless Shelters, and 5g:
data_clean %>%
filter(!IsReply) %>%
count(EmailSubject) %>%
arrange(desc(n)) %>%
head(10) %>%
format_table()
| Emailsubject | N |
|---|---|
| please protect boulder open space | 72 |
| help us curb youth tobacco and nicotine use in boulder | 63 |
| vaping | 20 |
| balance is needed on osbt! | 19 |
| vaping ordinance | 19 |
| long family farm | 17 |
| alpine balsam | 16 |
| alpine-balsam | 16 |
| marpa house | 16 |
| prairie dogs | 16 |
This is a good start. Let’s first bin the emails into categories by extracting common phrases
tidy_text <- data_clean %>%
unnest_tokens(word, EmailSubject) %>%
anti_join(stop_words, by = "word")
tidy_text %>%
count(word) %>%
filter(!(word %in% c("boulder", "council", "city", 'fw', "fwd", 2019))) %>% # remove some meaningless words
with(wordcloud(word, n, max.words = 50))
## Warning in wordcloud(word, n, max.words = 50): protect could not be fit on
## page. It will not be plotted.
tidy_phrases <-
data_clean %>%
unnest_tokens(word, PlainTextBody, token = "ngrams", n = 2) %>%
separate(word, into = c("w_1", "w_2"), remove = F) %>%
anti_join(stop_words, by = c("w_1" = "word")) %>%
anti_join(stop_words, by = c("w_2" = "word")) %>%
count(word) %>%
arrange(desc(n))
## Warning: Expected 2 pieces. Additional pieces discarded in 57337 rows [2,
## 3, 10, 11, 95, 96, 110, 111, 112, 113, 126, 127, 130, 131, 213, 214, 304,
## 305, 313, 314, ...].
tidy_phrases
## # A tibble: 145,898 x 2
## word n
## <chr> <int>
## 1 city council 4063
## 2 boulder city 1492
## 3 boulder county 1155
## 4 http tinyurl.com 883
## 5 council 6bb18bf32d5a8f2798e95f9c79a8dbd48d5a1350ba840a320e2eb050e… 866
## 6 police department 811
## 7 boulder police 761
## 8 dear council 737
## 9 prairie dogs 727
## 10 south boulder 723
## # … with 145,888 more rows